Introduction to Open Data Science - Course Project

About the project

My git repository

date()
## [1] "Mon Nov 20 22:19:58 2023"

Heads-up: I am new with R. During this first week, I studied the 4 first chapters of the book made on R studio and watched the given tutorial videos.

I have learned how to:
- set my directory
- use the R project mode, the R markdown file and github commit
- install and use libraries
- identify and use the different data types including the factors
- use the pipe %>% (it means “and then…”)
- open data set, read data set, read the structure of the data set
- select rows, columns and match text or rename column name
- filter data based on variable values
- create a table / tibble
- create an object that is reusable later
- use the time format
- add columns with text, int, calculations, categories etc.
- group data with one or many variables to then perform calculation
- combine all data types to create a string with the function paste()
- check the NA and remove the NA to do calculations
- summarize to sum up data per variable value
- manipulate table from wide to long and long to wide
- combine 2 tables (functions: bind, inner_join, full_join, left_join, right_join)
- use the most common arithmetic functions and use percentage
- rank and reorder value
- create plots (bar, col, line, point, box, histogram)
- play with the different style and aesthetic possibilities
- add labels to plots
- combine plots together

With this tool I discovered how easy and fast it can be to manipulate and visualize data from a big data set.

Conclusion of Week 1:
The book is really well made. The style is easy to follow and understand. We can learn at our own pace and try the examples in R studio to assimilate the concepts well.
I am new with R and I find it hard to learn so many new different concepts in a week. I would have liked to have more exercises, without the solutions directly written under and with a tool to correct our lines of code. Finally, I understand it is not necessary to know by heart every function, but I would like to understand well the capabilities of this tool by practicing and doing more real life exercises.

Lines of code to summarize learning in week 1:

# Get the data set
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
gbd_full <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/RHDS/master/data/global_burden_disease_cause-year-sex-income.csv")
## Rows: 168 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): cause, sex, income
## dbl (2): year, deaths_millions
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# This is a code I wrote to know the pourcentage of deaths per cause for each year.
exercise4 <- gbd_full %>% 
  group_by(year, cause) %>% 
  summarise(total_per_cause= sum(deaths_millions)) %>% 
  group_by(year) %>% 
  mutate(total_per_year =sum(total_per_cause)) %>% 
  mutate(percentage = percent(total_per_cause/total_per_year)) %>% 
  select(year,cause,percentage) %>% 
  pivot_wider(names_from = cause, values_from = percentage)
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
exercise4
## # A tibble: 7 × 4
## # Groups:   year [7]
##    year `Communicable diseases` Injuries `Non-communicable diseases`
##   <dbl> <chr>                   <chr>    <chr>                      
## 1  1990 33%                     9%       58%                        
## 2  1995 31%                     9%       60%                        
## 3  2000 29%                     9%       62%                        
## 4  2005 27%                     9%       64%                        
## 5  2010 24%                     9%       67%                        
## 6  2015 20%                     8%       72%                        
## 7  2017 19%                     8%       73%
# This is a code I wrote to know the number of deaths per sex and income and see the ratio Male / Female for each income type.
gbd_full %>% 
  filter(year ==1990) %>% 
  group_by(sex,income) %>% 
  summarize(deaths_per_sex_income = sum(deaths_millions)) %>% 
  pivot_wider(names_from = sex, values_from = deaths_per_sex_income) %>% 
  mutate(diff_M_F = Male/Female)
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 4
##   income       Female  Male diff_M_F
##   <chr>         <dbl> <dbl>    <dbl>
## 1 High           4.14  4.46     1.08
## 2 Low            2.22  2.57     1.16
## 3 Lower-Middle   8.47  9.83     1.16
## 4 Upper-Middle   6.68  7.95     1.19

Chapter 2: Regression and model validation

date()
## [1] "Mon Nov 20 22:20:01 2023"
# set directory
setwd("~/Desktop/Desktop - MacBook Pro de MARGOT/Open data with R 2023/IODS-project")

Thoughts about this week 2:

After reading all the chapters 1-7, I am now more confident to use R studio. I also understand the language better and I can do research on the web to use new function that I did not know.

It is very exciting to see how efficient is this tool and to think about all the analyzes we can do. I am an open university student and I can already see how to use this tool at work :).

Exercise: WRANGLING Please find the script file in the Github: create_learning2014_week2

Exercise: DATA ANALYSIS

QUESTION 1:

# Using the table from the course.
csv_table_read <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt", sep = ",", header = T)

# library
library("tidyverse")
library("finalfit")
library("broom")


csv_table_read
##     gender age attitude     deep  stra     surf points
## 1        F  53      3.7 3.583333 3.375 2.583333     25
## 2        M  55      3.1 2.916667 2.750 3.166667     12
## 3        F  49      2.5 3.500000 3.625 2.250000     24
## 4        M  53      3.5 3.500000 3.125 2.250000     10
## 5        M  49      3.7 3.666667 3.625 2.833333     22
## 6        F  38      3.8 4.750000 3.625 2.416667     21
## 7        M  50      3.5 3.833333 2.250 1.916667     21
## 8        F  37      2.9 3.250000 4.000 2.833333     31
## 9        M  37      3.8 4.333333 4.250 2.166667     24
## 10       F  42      2.1 4.000000 3.500 3.000000     26
## 11       M  37      3.9 3.583333 3.625 2.666667     31
## 12       F  34      3.8 3.833333 4.750 2.416667     31
## 13       F  34      2.4 4.250000 3.625 2.250000     23
## 14       F  34      3.0 3.333333 3.500 2.750000     25
## 15       M  35      2.6 4.166667 1.750 2.333333     21
## 16       F  33      4.1 3.666667 3.875 2.333333     31
## 17       F  32      2.6 4.083333 1.375 2.916667     20
## 18       F  44      2.6 3.500000 3.250 2.500000     22
## 19       M  29      1.7 4.083333 3.000 3.750000      9
## 20       F  30      2.7 4.000000 3.750 2.750000     24
## 21       M  27      3.9 3.916667 2.625 2.333333     28
## 22       M  29      3.4 4.000000 2.375 2.416667     30
## 23       F  31      2.7 4.000000 3.625 3.000000     24
## 24       F  37      2.3 3.666667 2.750 2.416667      9
## 25       F  26      3.7 3.666667 1.750 2.833333     26
## 26       F  26      4.4 4.416667 3.250 3.166667     32
## 27       M  30      4.1 3.916667 4.000 3.000000     32
## 28       F  33      3.7 3.750000 3.625 2.000000     33
## 29       F  33      2.5 3.250000 2.875 3.500000     29
## 30       M  28      3.0 3.583333 3.000 3.750000     30
## 31       M  26      3.4 4.916667 1.625 2.500000     19
## 32       F  27      3.2 3.583333 3.250 2.083333     23
## 33       F  25      2.0 2.916667 3.500 2.416667     19
## 34       F  31      2.4 3.666667 3.000 2.583333     12
## 35       M  20      4.2 4.500000 3.250 1.583333     10
## 36       F  39      1.6 4.083333 1.875 2.833333     11
## 37       M  38      3.1 3.833333 4.375 1.833333     20
## 38       M  24      3.8 3.250000 3.625 2.416667     26
## 39       M  26      3.8 2.333333 2.500 3.250000     31
## 40       M  25      3.3 3.333333 1.250 3.416667     20
## 41       F  30      1.7 4.083333 4.000 3.416667     23
## 42       F  25      2.5 2.916667 3.000 3.166667     12
## 43       M  30      3.2 3.333333 2.500 3.500000     24
## 44       F  48      3.5 3.833333 4.875 2.666667     17
## 45       F  24      3.2 3.666667 5.000 2.416667     29
## 46       F  40      4.2 4.666667 4.375 3.583333     23
## 47       M  25      3.1 3.750000 3.250 2.083333     28
## 48       F  23      3.9 3.416667 4.000 3.750000     31
## 49       F  25      1.9 4.166667 3.125 2.916667     23
## 50       F  23      2.1 2.916667 2.500 2.916667     25
## 51       M  27      2.5 4.166667 3.125 2.416667     18
## 52       M  25      3.2 3.583333 3.250 3.000000     19
## 53       M  23      3.2 2.833333 2.125 3.416667     22
## 54       F  23      2.6 4.000000 2.750 2.916667     25
## 55       F  23      2.3 2.916667 2.375 3.250000     21
## 56       F  45      3.8 3.000000 3.125 3.250000      9
## 57       F  22      2.8 4.083333 4.000 2.333333     28
## 58       F  23      3.3 2.916667 4.000 3.250000     25
## 59       M  21      4.8 3.500000 2.250 2.500000     29
## 60       M  21      4.0 4.333333 3.250 1.750000     33
## 61       F  21      4.0 4.250000 3.625 2.250000     33
## 62       F  21      4.7 3.416667 3.625 2.083333     25
## 63       F  26      2.3 3.083333 2.500 2.833333     18
## 64       F  25      3.1 4.583333 1.875 2.833333     22
## 65       F  26      2.7 3.416667 2.000 2.416667     17
## 66       M  21      4.1 3.416667 1.875 2.250000     25
## 67       F  23      3.4 3.416667 4.000 2.833333     28
## 68       F  22      2.5 3.583333 2.875 2.250000     22
## 69       F  22      2.1 1.583333 3.875 1.833333     26
## 70       F  22      1.4 3.333333 2.500 2.916667     11
## 71       F  23      1.9 4.333333 2.750 2.916667     29
## 72       M  22      3.7 4.416667 4.500 2.083333     22
## 73       M  23      3.2 4.833333 3.375 2.333333     21
## 74       M  24      2.8 3.083333 2.625 2.416667     28
## 75       F  22      4.1 3.000000 4.125 2.750000     33
## 76       F  23      2.5 4.083333 2.625 3.250000     16
## 77       M  22      2.8 4.083333 2.250 1.750000     31
## 78       M  20      3.8 3.750000 2.750 2.583333     22
## 79       M  22      3.1 3.083333 3.000 3.333333     31
## 80       M  21      3.5 4.750000 1.625 2.833333     23
## 81       F  22      3.6 4.250000 1.875 2.500000     26
## 82       F  23      2.6 4.166667 3.375 2.416667     12
## 83       M  21      4.4 4.416667 3.750 2.416667     26
## 84       M  22      4.5 3.833333 2.125 2.583333     31
## 85       M  29      3.2 3.333333 2.375 3.000000     19
## 86       F  29      3.9 3.166667 2.750 2.000000     30
## 87       F  21      2.5 3.166667 3.125 3.416667     12
## 88       M  28      3.3 3.833333 3.500 2.833333     17
## 89       F  21      3.3 4.250000 2.625 2.250000     18
## 90       F  30      3.0 3.833333 3.375 2.750000     19
## 91       F  21      2.9 3.666667 2.250 3.916667     21
## 92       M  23      3.3 3.833333 3.000 2.333333     24
## 93       F  21      3.3 3.833333 4.000 2.750000     28
## 94       F  21      3.5 3.833333 3.500 2.750000     17
## 95       F  20      3.6 3.666667 2.625 2.916667     18
## 96       M  22      3.7 4.333333 2.500 2.083333     17
## 97       M  21      4.2 3.750000 3.750 3.666667     23
## 98       M  21      3.2 4.166667 3.625 2.833333     26
## 99       F  20      5.0 4.000000 4.125 3.416667     28
## 100      M  22      4.7 4.000000 4.375 1.583333     31
## 101      F  20      3.6 4.583333 2.625 2.916667     27
## 102      F  20      3.6 3.666667 4.000 3.000000     25
## 103      M  24      2.9 3.666667 2.750 2.916667     23
## 104      F  20      3.5 3.833333 2.750 2.666667     21
## 105      F  19      4.0 2.583333 1.375 3.000000     27
## 106      F  21      3.5 3.500000 2.250 2.750000     28
## 107      F  21      3.2 3.083333 3.625 3.083333     23
## 108      F  22      2.6 4.250000 3.750 2.500000     21
## 109      F  25      2.0 3.166667 4.000 2.333333     25
## 110      F  21      2.7 3.083333 3.125 3.000000     11
## 111      F  22      3.2 4.166667 3.250 3.000000     19
## 112      F  25      3.3 2.250000 2.125 4.000000     24
## 113      F  20      3.9 3.333333 2.875 3.250000     28
## 114      M  24      3.3 3.083333 1.500 3.500000     21
## 115      F  20      3.0 2.750000 2.500 3.500000     24
## 116      M  21      3.7 3.250000 3.250 3.833333     24
## 117      F  20      2.5 4.000000 3.625 2.916667     20
## 118      F  20      2.9 3.583333 3.875 2.166667     19
## 119      M  31      3.9 4.083333 3.875 1.666667     30
## 120      F  20      3.6 4.250000 2.375 2.083333     22
## 121      F  22      2.9 3.416667 3.000 2.833333     16
## 122      F  22      2.1 3.083333 3.375 3.416667     16
## 123      M  21      3.1 3.500000 2.750 3.333333     19
## 124      M  22      4.0 3.666667 4.500 2.583333     30
## 125      F  21      3.1 4.250000 2.625 2.833333     23
## 126      F  21      2.3 4.250000 2.750 3.333333     19
## 127      F  21      2.8 3.833333 3.250 3.000000     18
## 128      F  21      3.7 4.416667 4.125 2.583333     28
## 129      F  20      2.6 3.500000 3.375 2.416667     21
## 130      F  21      2.4 3.583333 2.750 3.583333     19
## 131      F  25      3.0 3.666667 4.125 2.083333     27
## 132      M  21      2.8 2.083333 3.250 4.333333     24
## 133      F  24      2.9 4.250000 2.875 2.666667     21
## 134      F  20      2.4 3.583333 2.875 3.000000     20
## 135      M  21      3.1 4.000000 2.375 2.666667     28
## 136      F  20      1.9 3.333333 3.875 2.166667     12
## 137      F  20      2.0 3.500000 2.125 2.666667     21
## 138      F  18      3.8 3.166667 4.000 2.250000     28
## 139      F  21      3.4 3.583333 3.250 2.666667     31
## 140      F  19      3.7 3.416667 2.625 3.333333     18
## 141      F  21      2.9 4.250000 2.750 3.500000     25
## 142      F  20      2.3 3.250000 4.000 2.750000     19
## 143      M  21      4.1 4.416667 3.000 2.000000     21
## 144      F  20      2.7 3.250000 3.375 2.833333     16
## 145      F  21      3.5 3.916667 3.875 3.500000      7
## 146      F  20      3.4 3.583333 3.250 2.500000     21
## 147      F  18      3.2 4.500000 3.375 3.166667     17
## 148      M  22      3.3 3.583333 4.125 3.083333     22
## 149      F  22      3.3 3.666667 3.500 2.916667     18
## 150      M  24      3.5 2.583333 2.000 3.166667     25
## 151      F  19      3.2 4.166667 3.625 2.500000     24
## 152      F  20      3.1 3.250000 3.375 3.833333     23
## 153      F  20      2.8 4.333333 2.125 2.250000     23
## 154      F  17      1.7 3.916667 4.625 3.416667     26
## 155      M  19      1.9 2.666667 2.500 3.750000     12
## 156      F  20      3.5 3.083333 2.875 3.000000     32
## 157      F  20      2.4 3.750000 2.750 2.583333     22
## 158      F  20      2.1 4.166667 4.000 3.333333     20
## 159      F  20      2.9 4.166667 2.375 2.833333     21
## 160      F  19      1.9 3.250000 3.875 3.000000     23
## 161      F  19      2.0 4.083333 3.375 2.833333     20
## 162      F  22      4.2 2.916667 1.750 3.166667     28
## 163      M  35      4.1 3.833333 3.000 2.750000     31
## 164      F  18      3.7 3.166667 2.625 3.416667     18
## 165      F  19      3.6 3.416667 2.625 3.000000     30
## 166      M  21      1.8 4.083333 3.375 2.666667     19
# analyze the structure of the dataset
str(csv_table_read)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
# analyze the dimension of the dataset
dim(csv_table_read)
## [1] 166   7
# Missing data? No data is missing.
ff_glimpse(csv_table_read)
## $Continuous
##             label var_type   n missing_n missing_percent mean  sd  min
## age           age    <int> 166         0             0.0 25.5 7.8 17.0
## attitude attitude    <dbl> 166         0             0.0  3.1 0.7  1.4
## deep         deep    <dbl> 166         0             0.0  3.7 0.6  1.6
## stra         stra    <dbl> 166         0             0.0  3.1 0.8  1.2
## surf         surf    <dbl> 166         0             0.0  2.8 0.5  1.6
## points     points    <int> 166         0             0.0 22.7 5.9  7.0
##          quartile_25 median quartile_75  max
## age             21.0   22.0        27.0 55.0
## attitude         2.6    3.2         3.7  5.0
## deep             3.3    3.7         4.1  4.9
## stra             2.6    3.2         3.6  5.0
## surf             2.4    2.8         3.2  4.3
## points          19.0   23.0        27.8 33.0
## 
## $Categorical
##         label var_type   n missing_n missing_percent levels_n levels
## gender gender    <chr> 166         0             0.0        2      -
##        levels_count levels_percent
## gender            -              -
# summary statistics for each variable
missing_glimpse(csv_table_read)
##             label var_type   n missing_n missing_percent
## gender     gender    <chr> 166         0             0.0
## age           age    <int> 166         0             0.0
## attitude attitude    <dbl> 166         0             0.0
## deep         deep    <dbl> 166         0             0.0
## stra         stra    <dbl> 166         0             0.0
## surf         surf    <dbl> 166         0             0.0
## points     points    <int> 166         0             0.0
# Count per gender and percentage male / female
library("scales")
csv_table_read %>% 
  count(gender) %>% 
  mutate(total_percentage = n / nrow(csv_table_read)) %>% 
  mutate(total_percentage2 = percent(total_percentage))
##   gender   n total_percentage total_percentage2
## 1      F 110        0.6626506               66%
## 2      M  56        0.3373494               34%
# Mean and median for exercises points, and learning method per gender
summary(csv_table_read)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00
# The age varies from 17 to 55, mean is 25 and median 22. it suggests that there are some relatively higher values in the dataset
# The attitude varies from 1.4 to 5
# The points are from 7 to 33 and the mean is 22 and the median is 23. It suggests that there are some relatively lower values in the dataset
# we analyze the variables for both genders and females

# draw a scatter plot matrix of the variables in learning2014.
# [-1] excludes the first column (gender)
pairs(csv_table_read[-1])

# access the GGally and ggplot2 libraries
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

# create a more advanced plot matrix with ggpairs()
p <- ggpairs(csv_table_read, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
p

# some data shows that there could be correlation between some variables

# Relationship between points and attitudes
csv_table_read %>% 
  ggplot(aes(x= attitude, y= points)) +
  geom_point() +
  facet_wrap(~ gender) +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

QUESTION 2, 3 and 4:

Female model

#female model
female_data <- csv_table_read %>% 
  filter(gender == "F")

View(female_data)


# Fit a multiple linear model for females. Let's check how points are influenced by age, attitude and deep learning approach
female_fitmodel <- lm(points ~ age + attitude + deep, data = female_data)
# In this model I want to check if age, attitude and deep impact points without impacting each other. 

# summary of std, p value and 
summary(female_fitmodel) 
## 
## Call:
## lm(formula = points ~ age + attitude + deep, data = female_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.058  -3.263   0.622   4.003  10.533 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.59029    4.28982   3.168  0.00201 ** 
## age         -0.01743    0.06983  -0.250  0.80338    
## attitude     3.40151    0.70837   4.802 5.19e-06 ***
## deep        -0.27355    0.96270  -0.284  0.77685    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.353 on 106 degrees of freedom
## Multiple R-squared:  0.1791, Adjusted R-squared:  0.1559 
## F-statistic:  7.71 on 3 and 106 DF,  p-value: 0.0001043
summary(female_fitmodel)$r.squared
## [1] 0.1791183
#"age," "attitude," and "deep" explains about 18% of the variation of "points"

# p value intercept: it is significant as very small (0.002) and seems to play a significant role in the regression model
# baseline of model in 13.59 (estimate), when no factors are taken into account.
# age is not significant and is not correlated with points
# deep is not significant and is not correlated with points
# attitude is significant and it seems to play a significant role on the points.
# for one point increase in the attitude, the points increase by 3.63 (estimate)
# High std shows that the estimate is not so precise. It could due to sample size.

# I decide to drop the deep and the age variables and keep only the attitude.
female_fitmodel2 <- lm(points ~ attitude, data = female_data)
summary(female_fitmodel2) 
## 
## Call:
## lm(formula = points ~ attitude, data = female_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.0557  -3.3486   0.6137   3.9819  10.3668 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   12.194      2.156   5.655 1.29e-07 ***
## attitude       3.389      0.701   4.835 4.44e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.307 on 108 degrees of freedom
## Multiple R-squared:  0.1779, Adjusted R-squared:  0.1703 
## F-statistic: 23.38 on 1 and 108 DF,  p-value: 4.442e-06
tidy(female_fitmodel2)
## # A tibble: 2 × 5
##   term        estimate std.error statistic     p.value
##   <chr>          <dbl>     <dbl>     <dbl>       <dbl>
## 1 (Intercept)    12.2      2.16       5.66 0.000000129
## 2 attitude        3.39     0.701      4.83 0.00000444
summary(female_fitmodel2)$r.squared
## [1] 0.1779266
# p value is very low, same for the std, so this model is correct and justify the positive relation vs a positive attitude -> more points.
# rsquare is still quite low..
# The model doesn't provide a good fit for the data, and a significant portion of the variance is not explained. Is could be due to the sample size.

# autoplot: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage
# Identify issues with my regression model, such as non-linearity, non-normality, or influential data points

# autoplot doesnt knit.
#autoplot(female_fitmodel)
#autoplot(female_fitmodel2)

# I use plot function and which to get the desired plots.
plot(female_fitmodel,which = c(1,2,5))

plot(female_fitmodel2,which = c(1,2,5))

# we observe non normality at the end and beginning of the line in qq plot
# both models show that there are some points that are high leverage indicated on the residuals vs leverage

Male model

# male model
male_data <- csv_table_read %>% 
  filter(gender == "M")

View(male_data)
summary(male_data)
##     gender               age          attitude          deep      
##  Length:56          Min.   :19.0   Min.   :1.700   Min.   :2.083  
##  Class :character   1st Qu.:21.0   1st Qu.:3.100   1st Qu.:3.396  
##  Mode  :character   Median :24.0   Median :3.400   Median :3.792  
##                     Mean   :26.8   Mean   :3.443   Mean   :3.725  
##                     3rd Qu.:29.0   3rd Qu.:3.900   3rd Qu.:4.083  
##                     Max.   :55.0   Max.   :4.800   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 9.00  
##  1st Qu.:2.375   1st Qu.:2.312   1st Qu.:20.00  
##  Median :3.000   Median :2.625   Median :23.50  
##  Mean   :2.964   Mean   :2.704   Mean   :23.48  
##  3rd Qu.:3.531   3rd Qu.:3.167   3rd Qu.:28.25  
##  Max.   :4.500   Max.   :4.333   Max.   :33.00
# Fit a multiple linear model for males. Let's check how points are influenced by age, attitude and deep learning approach
male_fitmodel <- lm(points ~ age + attitude + deep, data = male_data)

# summary of std, p value and 
summary(male_fitmodel) 
## 
## Call:
## lm(formula = points ~ age + attitude + deep, data = male_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.8084  -3.3162  -0.0696   3.2195   9.9927 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.21897    6.06857   3.002 0.004112 ** 
## age         -0.16602    0.08456  -1.963 0.054974 .  
## attitude     4.31829    1.11699   3.866 0.000309 ***
## deep        -1.38378    1.22006  -1.134 0.261916    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.271 on 52 degrees of freedom
## Multiple R-squared:  0.2718, Adjusted R-squared:  0.2298 
## F-statistic:  6.47 on 3 and 52 DF,  p-value: 0.000835
tidy(male_fitmodel)
## # A tibble: 4 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   18.2      6.07        3.00 0.00411 
## 2 age           -0.166    0.0846     -1.96 0.0550  
## 3 attitude       4.32     1.12        3.87 0.000309
## 4 deep          -1.38     1.22       -1.13 0.262
summary(male_fitmodel)$r.squared 
## [1] 0.2718164
# similar results than for the female. 
# All variables have a smaller p value than for in the female model. 
# rsquare is higher as it explains 27% but it is still quite low. It could be due to the sample size.

# I decide to drop the deep and the age as variables and keep only the attitude.
male_fitmodel2 <- lm(points ~ attitude, data = male_data)
summary(male_fitmodel2) 
## 
## Call:
## lm(formula = points ~ attitude, data = male_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.6535  -2.9073  -0.5121   3.6974  10.2106 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    9.061      3.953   2.292  0.02581 *  
## attitude       4.189      1.129   3.711  0.00049 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.411 on 54 degrees of freedom
## Multiple R-squared:  0.2032, Adjusted R-squared:  0.1884 
## F-statistic: 13.77 on 1 and 54 DF,  p-value: 0.0004897
tidy(male_fitmodel2)
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)     9.06      3.95      2.29 0.0258  
## 2 attitude        4.19      1.13      3.71 0.000490
summary(male_fitmodel)$r.squared 
## [1] 0.2718164
# p value is very low, same for the std, so this model is correct and justify the positive relation vs a positive attitude -> more points.
# rsquare is higher as it explains 27% but it is still quite low
# The model doesn't provide a good fit for the data, and a significant portion of the variance is not explained. Is could be due to the sample size.

# autoplot: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage
# Identify issues with my regression model, such as non-linearity, non-normality, or influential data points

# autoplot doesnt knit !!
# autoplot(male_fitmodel)
# autoplot(male_fitmodel2) 

# plot with the plot function
plot(male_fitmodel,which = c(1,2,5))

plot(male_fitmodel2,which = c(1,2,5))

#The red line in residuals vs fitted stays quite close to the 0 line which is good
# both models show non normality. it is observed at the beginning of the qq plot
# both models show that there are some points that are high leverage indicated on the residuals vs leverage

other models tested:

test_fit1 <- csv_table_read %>% 
  lm(points ~ deep, data = .)
library(ggfortify)
summary(test_fit1)
## 
## Call:
## lm(formula = points ~ deep, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.6913  -3.6935   0.2862   4.9957  10.3537 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  23.1141     3.0908   7.478 4.31e-12 ***
## deep         -0.1080     0.8306  -0.130    0.897    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.913 on 164 degrees of freedom
## Multiple R-squared:  0.000103,   Adjusted R-squared:  -0.005994 
## F-statistic: 0.01689 on 1 and 164 DF,  p-value: 0.8967
tidy(test_fit1) # p value is small and significant
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   23.1       3.09      7.48  4.31e-12
## 2 deep          -0.108     0.831    -0.130 8.97e- 1
summary(test_fit1)$r.squared  # too low
## [1] 0.0001029919
test_fit2 <- csv_table_read %>% 
  lm(points ~ deep * gender, data = .)
library(ggfortify)
summary(test_fit2)
## 
## Call:
## lm(formula = points ~ deep * gender, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.3247  -3.3338   0.3369   4.6242  10.6787 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.36387    3.91828   5.708 5.33e-08 ***
## deep         -0.01001    1.06032  -0.009    0.992    
## genderM       2.67476    6.41719   0.417    0.677    
## deep:genderM -0.40787    1.71487  -0.238    0.812    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.922 on 162 degrees of freedom
## Multiple R-squared:  0.00922,    Adjusted R-squared:  -0.009127 
## F-statistic: 0.5025 on 3 and 162 DF,  p-value: 0.6811
tidy(test_fit2) # p value is small and significant
## # A tibble: 4 × 5
##   term         estimate std.error statistic      p.value
##   <chr>           <dbl>     <dbl>     <dbl>        <dbl>
## 1 (Intercept)   22.4         3.92   5.71    0.0000000533
## 2 deep          -0.0100      1.06  -0.00944 0.992       
## 3 genderM        2.67        6.42   0.417   0.677       
## 4 deep:genderM  -0.408       1.71  -0.238   0.812
summary(test_fit2)$r.squared # too low
## [1] 0.009220341

Basic data

# Female vs Male participants
csv_table_read %>% 
  ggplot(aes(x=gender)) +
  geom_bar()

# age chart and gender per age
csv_table_read %>% 
  ggplot(aes(x= age, fill = gender)) +
  geom_bar()

# age chart distribution per gender
csv_table_read %>% 
  ggplot(aes(x= age)) +
  facet_grid(~gender) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# age box plot distribution per gender
csv_table_read %>% 
  ggplot(aes(x= gender, y=age)) +
  geom_boxplot()

# relationship and distribution between age, points, and gender
csv_table_read %>% 
  ggplot(aes(y = points, x = age, colour = gender)) +
  geom_point() +
  labs(title = "Distribution of points per age and gender")

# with this data we can observe the different age points that drives the mean up (vs the median).

Gender analysis

Gender and points

# Distribution of the points per gender - histogram
csv_table_read %>% 
  ggplot(aes(x = points)) +
  geom_histogram() +
  facet_grid(~gender) +
  labs(title = "Histogram of points by Gender")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Distribution of the points per gender - boxplot
csv_table_read %>% 
  ggplot(aes(y = points, x = gender, colour = gender)) +
  geom_boxplot() +
  labs(title = "Boxplot of points by Gender")

#QQ plot - points per gender
csv_table_read %>% 
  ggplot(aes(sample = points)) +      
  geom_qq() +                          
  geom_qq_line(colour = "blue") +      
  facet_grid(~gender)

# mean points per gender - this is not significant
csv_table_read %>%               
  t.test(points ~ gender, data = .)
## 
##  Welch Two Sample t-test
## 
## data:  points by gender
## t = -1.1832, df = 107.84, p-value = 0.2393
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -3.0896905  0.7799502
## sample estimates:
## mean in group F mean in group M 
##        22.32727        23.48214

Gender and attitude

# attitude vs gender
csv_table_read %>% 
  ggplot(aes(x=gender, y= attitude)) +
  geom_boxplot()

# Type histogram
csv_table_read %>% 
  ggplot(aes(x = attitude)) +
  geom_histogram() +
  facet_grid(~ gender) +
  labs(title = "Histogram of attitude by Gender")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# QQ plot: attitude per gender
csv_table_read %>% 
  ggplot(aes(sample = attitude)) +      
  geom_qq() +                          
  geom_qq_line(colour = "blue") +      
  facet_grid(~gender)

# mean attitude per gender - This is significant and shows a difference between F and M on deep
csv_table_read %>%               
  t.test(attitude ~ gender, data = .)
## 
##  Welch Two Sample t-test
## 
## data:  attitude by gender
## t = -4.0932, df = 122.66, p-value = 7.657e-05
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -0.6718635 -0.2338508
## sample estimates:
## mean in group F mean in group M 
##        2.990000        3.442857

Gender and deep learning approach:

# deep learning approach vs gender
# We could do that for all approach of learning
# Type histogram
csv_table_read %>% 
  ggplot(aes(x = deep)) +
  geom_histogram() +
  facet_grid(~ gender) +
  labs(title = "Histogram of deep approach by Gender")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Type boxplot
csv_table_read %>% 
  ggplot(aes(y = deep, x = gender, fill = gender)) +
  geom_boxplot() +
  labs(title = "Boxplot of deep Approach by Gender")

# QQ plot: deep per gender
csv_table_read %>% 
  ggplot(aes(sample = deep)) +      
  geom_qq() +                          
  geom_qq_line(colour = "blue") +      
  facet_grid(~gender)

# mean deep per gender - This is quite significant and could show a correlation between the gender and the approach deep
csv_table_read %>%               
  t.test(deep ~ gender, data = .)
## 
##  Welch Two Sample t-test
## 
## data:  deep by gender
## t = -0.72082, df = 101.32, p-value = 0.4727
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -0.2546963  0.1189279
## sample estimates:
## mean in group F mean in group M 
##        3.656818        3.724702

Age Analysis

Age and points

# does not seem to impact on points
csv_table_read %>% 
  ggplot(aes(x= age, y=points)) +
  geom_point()+ 
  facet_wrap(~ gender) +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

Age and attitude

# does not seem to impact on attitude
csv_table_read %>% 
  ggplot(aes(x= age, y=attitude)) +
  geom_point() +
  facet_wrap(~ gender) +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

Learning approach: deep analysis

deep and age

# deep learning approach vs age - no correlation
# We could do that for all approach of learning
csv_table_read %>% 
  ggplot(aes(x= age, y= deep)) +
  geom_point()  +
  facet_wrap(~ gender) +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

deep and points

# deep learning approach vs points - The deep approach seems to have a correlation with the number of points
csv_table_read %>% 
  ggplot(aes(x= deep, y=points)) +
  geom_point() +
  facet_wrap(~ gender) +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

test_fit <- csv_table_read %>% 
  lm(points ~ deep * gender, data = .)
library(ggfortify)
summary(test_fit)
## 
## Call:
## lm(formula = points ~ deep * gender, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.3247  -3.3338   0.3369   4.6242  10.6787 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.36387    3.91828   5.708 5.33e-08 ***
## deep         -0.01001    1.06032  -0.009    0.992    
## genderM       2.67476    6.41719   0.417    0.677    
## deep:genderM -0.40787    1.71487  -0.238    0.812    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.922 on 162 degrees of freedom
## Multiple R-squared:  0.00922,    Adjusted R-squared:  -0.009127 
## F-statistic: 0.5025 on 3 and 162 DF,  p-value: 0.6811
# deep seems to have a significant impact on "points."

Chapter 3

setwd("~/Desktop/Desktop - MacBook Pro de MARGOT/Open data with R 2023/IODS-project")

library(broom)
library(ggplot2)
library(dplyr)
data <- read.csv("alc.csv",sep = ",", header = T)

class(data$high_use)
## [1] "logical"
class(data$famrel)
## [1] "integer"
class(data$goout)
## [1] "integer"
class(data$studytime)
## [1] "integer"

Question 3, 4 and 5

The purpose of your analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. To do this, choose 4 interesting variables in the data and for each of them, present your personal hypothesis about their relationships with alcohol consumption. (0-1 point)

Hypothesis 1: students with a good family relationship have less chance to have a high consumption of alcohol

Comments on the below codes chunk: There is a negative correlation between family relationship and the consumption of alcohol. When the relationship is not good (from 1 to 5), there are more risk to drink lot of alcohol. Pvalue is 0,0204 which is representative(<0.05). The coefficient for the variable Famrel in this model is -0.2838. The odds ratio associated with a one-unit change in Famrel is approximately exp(-0.2838) = 0.753. For a one-unit decrease in Famrel, the odds of ‘high_use’ decrease by about 1 - 0.753 = 24.7%. With 95% confidence, the true value of the odds ratio for the intercept lies between approximately 0.49 and 3.35. For the famrel, the odds ratio lies between 0.59 and 0.95, since it is less than 1, there is a statistical significance, with a negative effect from the variable famrel to the high_use. The null deviance is bigger than the residual deviance of the model. It shows that the model explains a part of the variable high-use.

# I add the family = binomial as we do the analyze on a binary variable (True and False)
logistic_model_family_alcohol <-  glm(high_use ~ famrel, data = data, family = binomial)
summary(logistic_model_family_alcohol)
## 
## Call:
## glm(formula = high_use ~ famrel, family = binomial, data = data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   0.2573     0.4853   0.530   0.5960  
## famrel       -0.2838     0.1224  -2.318   0.0204 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 446.67  on 368  degrees of freedom
## AIC: 450.67
## 
## Number of Fisher Scoring iterations: 4
# get the coef
coef(logistic_model_family_alcohol) %>% exp()
## (Intercept)      famrel 
##   1.2934265   0.7528865
# odd of being in high use - exponential of the coef
exp(coef(logistic_model_family_alcohol))
## (Intercept)      famrel 
##   1.2934265   0.7528865
# get the intervals
confint(logistic_model_family_alcohol) %>% exp()
## Waiting for profiling to be done...
##                 2.5 %    97.5 %
## (Intercept) 0.4961949 3.3547129
## famrel      0.5911258 0.9570208
# get the variance, BIC and AIC
logistic_model_family_alcohol %>% 
  glance()
## # A tibble: 1 × 8
##   null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1          452.     369  -223.  451.  458.     447.         368   370
# If I want to check the impact of high use to family relationship -> famrel is ordinal so I use the ordinal regression model
library(ordinal)
## 
## Attaching package: 'ordinal'
## The following object is masked from 'package:dplyr':
## 
##     slice
ordinal_model_family_alcohol <- clm(factor(famrel) ~ high_use, data = data)
summary(ordinal_model_family_alcohol)
## formula: factor(famrel) ~ high_use
## data:    data
## 
##  link  threshold nobs logLik  AIC    niter max.grad cond.H 
##  logit flexible  370  -454.70 919.40 5(0)  3.77e-08 2.0e+01
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## high_useTRUE  -0.5411     0.2139  -2.529   0.0114 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##     Estimate Std. Error z value
## 1|2  -4.0079     0.3672 -10.914
## 2|3  -2.7746     0.2194 -12.645
## 3|4  -1.3118     0.1422  -9.222
## 4|5   0.8468     0.1300   6.516
# This box plot confirms our hypothesis and conclusion. We can see that the mean and median of the family relationship quality is very different whether the students drink a lot or not. 
data %>% 
  ggplot(aes(x= factor(high_use), y= famrel)) +
  geom_boxplot() +
  labs(title = "Quality of family relationship vs the alcohol consumption",
       x = "High consumption",
       y = "quality of family relationship")

data %>%
  ggplot(aes(x = as.factor(famrel), fill = as.factor(high_use))) +
  geom_bar(position = "fill") +
  labs(
    title = "Relationship Between Family relationship and Alcohol Consumption",
    x = "Family relation score from 1 to 5",
    y = "Proportion of students",
    fill = "High Use of alcohol: True of False"
  )

# there are less score 5 and way more score 2 among the students who drink a lot
data %>%
  ggplot(aes(x = as.factor(high_use), fill = as.factor(famrel))) +
  geom_bar(position = "fill") +
  labs(
    title = "Family relation score distribution by Alcohol Consumption",
    x = "High Use of alcohol: True of False",
    y = "Proportion of students",
    fill = "Family relation score from 1 to 5"
  )

Hypothesis 2: students who go more out with their friends have more risk to have a high consumption of alcohol

Comments on the below codes chunk: There is a a positive relationship between going out and the consumption of alcohol and a significant p value (very small value), which confirms the hypothesis. The more students go out the more risks they have to have a high consumption in alcohol.

The coefficient for goout is 0.7563.The odds ratio associated with a one-unit change in Goout is approximately exp(0.7563) = 2.13.For a one-unit increase in Goout, the odds of ‘high_use’ increase by about 2.13 - 1 = 1.13, which is about 113% increase.

With 95% confidence, the true value of the odds ratio for the intercept lies between approximately 0.015 and 0.078. For the goout, the odds ratio lies between 1.71 and 2.66, since it is > 1, there is a statistical significance with a positive effect of this variable on the high_use.

The null deviance is way bigger than the residual deviance of the model. It shows that the model explains the variable high-use, even better than in the hypothesis 1. The AIC of this model is even smaller than the previous model too, which confirms my conclusion.

logistic_model_goout_alcohol <- glm(high_use ~ goout, data = data, family = binomial)
summary(logistic_model_goout_alcohol)
## 
## Call:
## glm(formula = high_use ~ goout, family = binomial, data = data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.3368     0.4188  -7.968 1.62e-15 ***
## goout         0.7563     0.1165   6.494 8.34e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 403.05  on 368  degrees of freedom
## AIC: 407.05
## 
## Number of Fisher Scoring iterations: 4
tidy(logistic_model_goout_alcohol)
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -3.34      0.419     -7.97 1.62e-15
## 2 goout          0.756     0.116      6.49 8.34e-11
# get the coef
coef(logistic_model_goout_alcohol) %>% exp()
## (Intercept)       goout 
##  0.03554946  2.13048020
# get the exp of the coef
exp(coef(logistic_model_goout_alcohol))
## (Intercept)       goout 
##  0.03554946  2.13048020
# get the intervals
confint(logistic_model_goout_alcohol) %>% exp()
## Waiting for profiling to be done...
##                  2.5 %     97.5 %
## (Intercept) 0.01515176 0.07850224
## goout       1.70566958 2.69518608
# plots
data %>% 
  ggplot(aes(x= factor(high_use), y= goout)) +
  geom_boxplot() +
  labs(title = "Going out with friends vs the alcohol consumption",
       x = "High consumption",
       y = "go out with friends (from 1 to 5)")

# the students who are drinking more are more numerous among the students who gave the score 4 and 5 for the goout variable.
data %>%
  ggplot(aes(x = factor(goout), fill = factor(high_use))) +
  geom_bar(position = "fill") +
  labs(
    title = "Relationship Between Going Out Behavior and Alcohol Consumption",
    x = "going out with friends from 1 to 5",
    y = "Proportion of students",
    fill = "High Use of alcohol: True of False"
  )

# there are more score 1, 2 and 3 among the students who do not drink a lot
data %>%
  ggplot(aes(x = factor(high_use), fill = factor(goout))) +
  geom_bar(position = "fill") +
  labs(
    title = "Relationship Between Alcohol Consumption and going out behavior",
    x = "High Use of alcohol: True of False",
    y = "Proportion of students",
    fill = "going out with friends from 1 to 5"
  )

Hypothesis 3: students who study a lot have less risk to have a high consumption of alcohol

Comments about the below code chunk: The p value is very small and estimate is negative. We can conclude there is a negative relation between the study time and the alcohol consumption. When the study time increases the odds of high alcohol consumption decreases.

The coefficient for Studytime is -0.6208. The odds ratio associated with a one-unit change in Studytime is approximately exp(-0.6208) = 0.538. For a one-unit decrease in Studytime, the odds of ‘high_use’ decrease by about 1 - 0.538 = 46.2%.

With 95% confidence, the true value of the odds ratio for the intercept lies between approximately 0.79 and 2.67. For the study time, the odds ratio lies between 0.39 and 0.72, since it is < 1, there is a statistical significance with a negative effect of this variable on the high_use.

The residual deviance is smaller than the null deviance, it means the model explains the variable high use. The AIC is smaller than the first model but not as small as the second model.

logistic_model_studytime_alcohol <- glm(high_use ~ studytime, data = data, family = binomial)
summary(logistic_model_studytime_alcohol)
## 
## Call:
## glm(formula = high_use ~ studytime, family = binomial, data = data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.3649     0.3102   1.176    0.239    
## studytime    -0.6208     0.1542  -4.027 5.66e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 433.82  on 368  degrees of freedom
## AIC: 437.82
## 
## Number of Fisher Scoring iterations: 4
tidy(logistic_model_studytime_alcohol)
## # A tibble: 2 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)    0.365     0.310      1.18 0.239    
## 2 studytime     -0.621     0.154     -4.03 0.0000566
# get the coef
coef(logistic_model_studytime_alcohol) %>% exp()
## (Intercept)   studytime 
##   1.4403849   0.5375316
#Get the odds - exponential of coefficient: 
exp(coef(logistic_model_studytime_alcohol))
## (Intercept)   studytime 
##   1.4403849   0.5375316
# get the intervals
confint(logistic_model_studytime_alcohol) %>% exp()
## Waiting for profiling to be done...
##                 2.5 %    97.5 %
## (Intercept) 0.7881099 2.6654451
## studytime   0.3933727 0.7208174
# This plot shows the difference in median and mean for the students who drink a lot or not
data %>% 
  ggplot(aes(x= factor(high_use), y= studytime)) +
  geom_boxplot() +
  labs(title = "Study time vs the alcohol consumption",
       x = "High consumption",
       y = "Study time (from 1 to 4)")

# way more students among students who score they study time around 1 and 2
data %>%
  ggplot(aes(x = factor(studytime), fill = factor(high_use))) +
  geom_bar(position = "fill") +
  labs(
    title = "Relationship Between Study time Behavior and Alcohol Consumption",
    x = "Study time from 1 to 4",
    y = "Proportion of students",
    fill = "High Use of alcohol: True of False"
  )

# way more score 1 and way less 3 and 4 among students who drink a lot.
data %>%
  ggplot(aes(x = factor(high_use), fill = factor(studytime))) +
  geom_bar(position = "fill") +
  labs(
    title = "Relationship Between Alcohol Consumption and Study time behavior",
    x = "High Use of alcohol: True of False",
    y = "Proportion of students",
    fill = "Study time from 1 to 4"
  )

Hypothesis 4: students who chose the school because of its reputation has less risk to have a high consumption of alcohol

Comments about the below code: The p value is 0,022 which tells that there is a correlation between the reason to choose a school and the consumption of alcohol.

We don’t know with the test which reason is the most chosen by the students who drink a lot of alcohol vs the ones who don’t. Some plots can help us identify it (check below).

# Creating a contingency table
chi_table_alcohol_reasons <- table((data$high_use), as.factor(data$reason))

# Performing the chi-squared test
chi_square_test <- chisq.test(chi_table_alcohol_reasons )
chi_square_test
## 
##  Pearson's Chi-squared test
## 
## data:  chi_table_alcohol_reasons
## X-squared = 9.563, df = 3, p-value = 0.02267
# Summary of the test
summary(chi_square_test)
##           Length Class  Mode     
## statistic 1      -none- numeric  
## parameter 1      -none- numeric  
## p.value   1      -none- numeric  
## method    1      -none- character
## data.name 1      -none- character
## observed  8      table  numeric  
## expected  8      -none- numeric  
## residuals 8      table  numeric  
## stdres    8      table  numeric
# This show that students who don't drink a lot have chosen their school more for its reputation than the one who drink a lot. For the ones who drink, the reasons home and other are way more important than for the students who don't drink too much. The reason courses seem to be equally important for them.
data %>%
  ggplot(aes(x = high_use, fill = reason)) +
  geom_bar(position = "fill") +
  labs(
    title = "Relationship Between Reason to chose their school and Alcohol Consumption",
    x = "High Consumption of a",
    y = "Proportion of students",
    fill = "High Use of alcohol: True of False"
  )

data %>%
  ggplot(aes(x = reason, fill = high_use)) +
  geom_bar(position = "fill") +
  labs(
    title = "Relationship Between Alcohol Consumption and Reason to chose their school",
    x = "High Use of alcohol: True of False",
    y = "Proportion of students",
    fill = "Reason to choose the school"
  )

Other models to test:

By adding each variable together, we can see that the variation in high use is better explained (this is because the deviance is decreasing from 452 to 379).

All three variables (goout, famrel, and studytime) have statistical significance in predicting alcohol consumption (high_use).

logistic_model_3var <- glm(high_use ~ goout + famrel + studytime, data = data, family = binomial)
logistic_model_3var_2 <- glm(high_use ~ goout * famrel * studytime, data = data, family = binomial)
summary(logistic_model_3var)
## 
## Call:
## glm(formula = high_use ~ goout + famrel + studytime, family = binomial, 
##     data = data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.7286     0.6941  -1.050  0.29381    
## goout         0.7777     0.1203   6.467 9.99e-11 ***
## famrel       -0.3777     0.1367  -2.763  0.00573 ** 
## studytime    -0.6139     0.1672  -3.673  0.00024 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 379.74  on 366  degrees of freedom
## AIC: 387.74
## 
## Number of Fisher Scoring iterations: 4
tidy(logistic_model_3var)
## # A tibble: 4 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -0.729     0.694     -1.05 2.94e- 1
## 2 goout          0.778     0.120      6.47 9.99e-11
## 3 famrel        -0.378     0.137     -2.76 5.73e- 3
## 4 studytime     -0.614     0.167     -3.67 2.40e- 4
# Anova conclusion: The p-value associated with the difference in deviance (Pr(>Chi)) is 0.1362, suggesting that the difference in fit between the two models is not statistically significant
anova(logistic_model_3var, logistic_model_3var_2, test="LRT")
## Analysis of Deviance Table
## 
## Model 1: high_use ~ goout + famrel + studytime
## Model 2: high_use ~ goout * famrel * studytime
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       366     379.74                     
## 2       362     372.74  4    6.995   0.1362
# when comparing with for instanc: logistic_model_family_alcohol, the residual deviance of logistic_model_3var is better and the p is significant. 
anova(logistic_model_3var, logistic_model_family_alcohol, test="LRT")
## Analysis of Deviance Table
## 
## Model 1: high_use ~ goout + famrel + studytime
## Model 2: high_use ~ famrel
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       366     379.74                          
## 2       368     446.67 -2  -66.934 2.921e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# best model is the logistic_model_3var according to BIC and AIC, because metrics are smaller for this model
AIC(logistic_model_3var, logistic_model_3var_2)
##                       df      AIC
## logistic_model_3var    4 387.7372
## logistic_model_3var_2  8 388.7422
BIC(logistic_model_3var, logistic_model_3var_2)
##                       df      BIC
## logistic_model_3var    4 403.3912
## logistic_model_3var_2  8 420.0502

question 6: Model 1 (high_use ~ goout + famrel + studytime)

Using the variables which, according to your logistic regression model, had a statistical relationship with high/low alcohol consumption, explore the predictive power of you model. Provide a 2x2 cross tabulation of predictions versus the actual values and optionally display a graphic visualizing both the actual values and the predictions. Compute the total proportion of inaccurately classified individuals (= the training error) and comment on all the results. Compare the performance of the model with performance achieved by some simple guessing strategy.

# prediction based on the model: logistic_model_3var
predictions <-  predict(logistic_model_3var, type = "response")

#Create the confusion matrix: 
table(data$high_use, predictions > 0.5)
##        
##         FALSE TRUE
##   FALSE   235   24
##   TRUE     63   48
# correct are False false + true true = 235 + 48 = 293
# wrong are False True and True False = 24 + 63 = 87

# Probability of miss classifications  = 87/(293+87) = 23% 
# odds of miss classification = 87/293 = 30%

# precision: correctly positive among all positives predictions = 0.33%
precision <- 24 / (24 + 48)
# recall: correctly positive among all actual positives = 0.28%
recall <- 24 / (24 + 63)
# F1 score: 0.30%
 2 * (precision * recall) / (precision + recall)
## [1] 0.3018868
 # Accuracy of always predicting the majority class:  if the majority class is "FALSE," then predicting "FALSE" for every instance would result in a correct prediction rate of 70%
mean(data$high_use == FALSE)
## [1] 0.7
# Calculate the total proportion of errors is 23.5%. This means my model is a bit better than the majority class which is 70% (30% of non-accuracy).
mean(data$high_use != (predictions > 0.5))
## [1] 0.2351351
# to conclude, my model is a bit better than using the prediction based on the majority class. But it should be improved for better accuracy!

Bonus:

I had to look at chat gpt fully for this one. I did not know the method. I now understand this method splits the data in 10 data samples and will use 9 of them to predict the last one, rotating to test them all. In that case, the average of prediction for the different tests is 74.9% which is better than the 70% (majority class).

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(lattice)
# Define the model formula
formula <- factor(high_use) ~ goout + famrel + studytime

# Define the number of folds for cross-validation
num_folds <- 10

# Create a training control with 10-fold cross-validation
train_control <- trainControl(method = "cv", number = num_folds)

# Train the model using 10-fold cross-validation
model <- train(formula, data = data, method = "glm", family = binomial, trControl = train_control)

# Get cross-validated results
results <- model$resample
str(results)
## 'data.frame':    10 obs. of  3 variables:
##  $ Accuracy: num  0.622 0.649 0.784 0.919 0.816 ...
##  $ Kappa   : num  -0.0117 0.0873 0.4219 0.7894 0.5199 ...
##  $ Resample: chr  "Fold01" "Fold02" "Fold03" "Fold04" ...
# Calculate mean prediction error
mean(results$Accuracy)
## [1] 0.7483958